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Field of the invention: 

This invention relates to a computer-based method for identifying peptides useful as drug 
targets. More particularly this invention relaxes to a method for identification of invariant peptide 
motifs in protein sequence data of various organisms useful as potential drug targeis. This invention 
further provides a method for assignment of function lo hypothetical Open Reading Frames 
(proteins) of unknown function through exact amino acid sequence identity signature. 

This invention provides a novel approach for identifying structural and fiinctional signatures 
of conserved invariant amino acid sequences of proteins that can serve as potential candidates for 
drug targets. Emergence of drug resistant strains has necessitated identification of new drugs and 
drug targets. Unique invariant peptide motifs present in the proteins of pathogen but absent in the 
proteins of host indicate potential drug targets. The invention also provides a method for genome 
wise comparison of large number of protein sequences simultaneously. Yet another utility is for 
identifying peptide sequences useful for specific diagnosis of infections. 

Background of the invention: 

It is known that most of the drugs that are available today to cure infections bind to specific 
protein target molecules in the cell of the causative organism e.g., several antibiotics are known to 
disrupt the function of ribosomes so that tT:e protein translation is affected. In these cases it has 
been found thai the drugs either bind to the ribosomal RNA directly or RNA protein complexes 
(Wimberly et al, 1999). Chemical probing experiments have revealed that the drug binds to certain 
nucleotide sequences of ribosomal RNA that are 'invariant^ in structurally analogous regions in 
different organisms (Porse and Garrett, 1999), The other class of drugs serves to block other 
functions such as transcription (Cutler et al^ 1999) or fatty acid synthesis in the bacterial cell 
(McCafferty et al., 1999). 

Recently, several drug resistant strains (Ghannoum & Rice, 1999) of pathogenic bacteria 
have emerged that renders the current treatment procedures ineffective in curing infections due to 
bacterial pathogens. This necessitates the identification of new drug targets and the corresponding 
drugs. For this purpose, the availability of complete genome sequences from various microbes 
offers us an opportunity to analyze all the proteins encoded in a given genome. Since most drugs 
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known today target proteins, it is likely that analyzing all the proteins in a given bacterium 
may provide new valid drug targets. 

The knowledge of conserved invariant sequences in a protein can be useful in understanding 
certain features of a protein's architecture, such as buried versus exposed location of a segment or 
the presence of specific secondary structural elennents (Rooman and Wodak, 1988, Presnell el al., 
1992). The protein's functional rqle is the most important aspect of conserved invariant sequences. 
Methods of usual sequence analysis include BLAST (Altschul ct al., 1990), and FASTA (Wilbur 
and Lipman, 1983)* These methods carryout sequence alignments whose quality is evaluated using 
an amino acid substitution matrix. Statistical calculations arc performed and the results are output 
in a ranked manner, with the best similar sequence ranking first. However, these methods are not 
designed to do a genome-wise comparison simultaneously to identify invariant sequence motifs that 
are of panicular importance in this work. 

In order to compare each protein of one organism with all other proteins of several other 
organisms, either one has to use BLAST one by one or a batch BLAST has to be used which is 
highly time consuming and therefore not practicable. Even if this were done, at the end of the 
exercise, one would obtain the overall similarity of a set of homologous proteins and alignments. 

The problem with multiple sequence alignment is that it is biased to the selection of 
proteins. Only proteins that are functionally related will give a clear picture of any relationship 
between the selected proteins. Such procedures are labor intensive and time consuming and leads to 
results that need further processing and filtering. However, by these methods it is not possible to 
compare all proteins of several organisms and retrieve conserved invariant peptides. 

The present invention provides a novel computer based method to look for invariant 
sequence motif that will lead to manifold usage as mentioned above and obviates the drawbacks 
listed above. 

The applicants' approach is based on the paradigm that the invariant sequence motifs 
between the different bacterial proteins must be responsible for an important role for the structure 
and the function of the protein. Of the numerous ways by which drug targets can be identified, wc 
have taken an approach based on comparative &, structural genomics. In this case, the invariant 
sequence motifs may be either directly or indirectly involved in the function of the subject protein 
molecule. This approach is derived from the concept that invariant sequence motifs that have 
remained unchanged across bacteria that are related either distantly or closely should have evolved 
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a unique structural feature that can not be comproTnised. Indeed, ii is even possible thai the so- 
called conservative subsiitutions are also not tolerated in these invariant sequence motifs. To this 
end, we have identified several invariant peptide motifs by direct sequence comparison between 
various bacterial genomes without any a priori assumptions. This purely unbiased and unassumed 
way of studying the sequences has the benefit of revealing unidentified sequence propcnies in the 
various genomes. 

Since the invariant sequence motifs may be important for the function of the subject protein 
molecule, we aim to develop these peptide motifs as potential broad-spectrum antibacterial drug 
targets. It is probable that a small molecule that can bind specifically to these invariant sequences 
may cause disruption of function of the subject protein molecule. It is envisaged that this in silica 
approach will provide new leads for experimental validation to derive functions from protein 
sequences existing in the available databases. 

Objects of the invention: 

The main object of the present invention is to provide a method for genome-wise protein 
sequence comparison of several organisms and identification of invariant conserved peptides. 

Another object of the present invention relates to a novel computer based method for 
performing genome-wise comparison of several organisms, wherein the said computational method 
involves creation of peptide libraries firom }:rotein sequences of several organisms and subsequent 
comparison leading to identification of conserved invariant peptide motifs. 

Yet another object of the present invention relates to providing a method useful for 
identification of potential drug targets and can serve as drug screen for broad spectrum 
antibacterials as well as for specific diagnosis of infection. 

Another object of the present invention is to assign suitable function to proteins of yet 
unknown functions. 

Yet another object is to provide a computational method incorporating the invariant 
peptides or their analogs for identifying potential drug targets. 
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Summary of the Invention: 

The applicants have invented a method to identify invariant peptide motifs^ obtained from 
miihons of peptides present in protein sequences of many organisms that has withstood natural 
selection. These sequences are thus structural determinants of proteins, which could be targeted or 
can be used as screen as target for drug discovery. These special invariant peptide signatures are 
also fund to be associated with special functional class of proteins. 

The present method will also allow predicting toxicity, alternate target in host cell for drug 
targeted against a specific peptide motif of a pathogenic organism or any host protein target 
responsible for a disease process. The method could be extended with lower stringencies to larger 
number of proteins and also for eukaryotcs and multicellular organisms. 

Other and further aspects, features and advantages of the present invention will be apparent 
from the following description of the presently preferred embodiments of the invention given for 
the purpose of disclosures. 

Brief description of the computer programs: 
1. PEPLIB 

Objective: To create peptide libraries of organisms from their FASTA formal protein files. Thus 
overlapping peptides of user defined length are generated and then only non-redundant peptides are 
arranged alphabetically in the output file. 

Programming language: PERL on IRIX platform, 

1. PEPLIMF 

Objecttv{^: This program compares the peptide libraries of organisms selected by the user and 
returns the peptides sequences that are common across the genomes. 

Programming language: PERL on IRIX platform. 
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3. PEPXTRACT 

Objective: This program takes peptide file as input, searches in the FASTA format protein files 
(pep files) and returns the details about the peptides. The details include the PID, location of the 
peptide in the protein. Organism name etc. 

Programming languuge: PERL on IRIX platform. 

4, PEPSTITCH 

Objective: Tliis program joins the peptides depending on certain fixed criteria (the two peptides 
should have the same PID and their locations should be adjacent) and removes overlappings and 
reports all the conserved invariant peptides. 

Programming language: PERL on IRIX platform. 

Details of the invention: 

Theoretically speaking, though, a huge number of combinations are possible at amino acid 
level to form a peptide of a given length only a limited fi-action has been observed in biological 
systems. Out of this limited fraction, only a few peptides remained invariant across the genomes of 
different organisms. In this work, we sought to answer the question pertaining to the nature of 
peptides that are invariant across all the pathogenic and nonpathogenic bacterial genome. 

In the present invention it has been shown that a stretch of amino acid conservation in 
proteins of various organisms can provide accurate distinction between different classes of proteins. 
Generally, these proteins are identified as proteins having very basic function in the survival of the 
organism* 

The protein sequences of several organisms were obtained computationally from the 
existing databases fNCBI, genbank/gcnomes/bacteria). These were then chopped computationally 
into peptide fragments of 'N' amino acid residues by a specially developed computer program 
PEPLIB- A library of peptides of length was created for all the proteins of each organism by 
sliding the window of length 'N' along the sequence by one residue at a time. The peptides thus 
obtained were computationally sorted in an alphabetical order according to single letter amino acid 
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code, and the redundancy was removed by deleting duplicated peptides. The peptide libraries 
of various organisms were then compared computationally to find out common peptides. The 
comparison was done using a specially developed computer program labeled PEPLIMP. The 
common peptides were located computationally in the original proteins using PEPXTRACT 
program and were subsequently labeled with their proteins of origin and location. These common 
peptides were backstitchcd computationally to form a long chain of common peptides. This was 
done using PEPSTICH program. 

These fragments of common peptides thus obtained were termed as invariant peptides as 
they originated from functionally conserved proteins. All the conserved invariant peptides obtained 
from the same protein were then clustered into one group. The secondary structure of these peptides 
was validated from the protein crystal structure database namely Protein Data Bank (PDB), 

Accordingly the invention provides a computer-based method for identifying invananl 
peptide motifs useful as drug targets wherein the said method comprises the steps of: 

i) generating computationally overlapping peptide libraries from all the protein sequences of the 
selected organisms available at http://www,ncbi.nlm.nih.gov, 

ii) sorting computationally the peptides of length *N' obtained as above, alphabetically, according 
to single letter amino acid code, 

iii) matching computationally common peptide sequences of the selected bactena, 

iv) locating computationally these common peptides in the original proteins and subsequently 
labeling them with their origin and location, 

v) joining computationally the overlappmg common peptides to obtain a long chain of invariant 
peptide sequences, 

vi) annotating secondary structure of these conserved peptides from the crystal structure database, 

vii) comparing pathogenic strain genomes against genomes of non-pathogenic strains and selecting 
the sequences not commonly conserved in these two groups, 

viii) validating computationally the invariant sequence motifs as potential drug target sequence by 
searching for the given conserved sequences in the host genome and rejecting the ones present in 
the host genome. 
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In an embodiment to the present invention the length of the sliding window of length 
'N' may range from 4 to any length of amino acid residues. 

In another embodiment to the present invention the protein sequence data may be taken 
from any organism but not specifically limited to microbes such as Mycoplasma pneumoniae, 
Helicobacter pylori, Ilemophillus influenzae, Mycobacterium tuberculosis. Mycoplasma 
genitalium. Bacillus subtillis, Escherichia coli^ 



In further embodiment the conserved peptide motifs as identified comprise: 



K AAQSIGEPGTQLT 




43, LLNRAPTLH 


2* AGDGTTTAT 




44. LPDKAIDLIDE 


3. AGRHGNKG 




45, LPGKLADC 


4, AHmAGKTTT 




46. LSGGQQQR 


5. CPIETPEG 




47. MGHVDHGKT 


6. DEPSIGLH 




48. NADFDGDQMAVH 


7, DEPTSALD 




49. NGAGKSTL 


8. DEPTTALDVT 




50- TVtiLGKRVD 


9. DHAGIATQ 




51* NTDAEGRL 


10. DHPHGGGEG 




52. PSAVGYQPTLA 


IL DLGGGTFD 




S3: QRVAIARA 


12. DVLDTWFSS 




54.,QRYKGLGEM 


13. ERERGITI 




55. RDGLKPVHRR 


14. ERGITITSAAT 




56* SALDVSIQA 


15. ESRRIDNQLRGR 




57. SGGLHGVG 


16. FSGGQRQR 




58. SGSGKSSL 


17. GEPGVGKTA 




59. SGSGKSTL 


18. GFDYLRDN 




60, SVFAGVGERTREGND 


19, GHNJ.'OEHS 




61. TGRTHQIRVH 


20 GIDLGTTNS 




62. TGVSGSGKS 


21. GTIVLLREGLD 




63. TLSGGEAQRI 


22. GIVGLPNVGKS 




64. TNKYAEGYP 


23, GKSSLLNA 




65. TPRSNPATY 


24. GLTGRKIIVDTYG 




66. VEGDSAGG 


25. GPPGTGKTLLA 




67, VRKRPGMYIG 


26. GPPGVGKT 






27. GSGKTTLL 






28. GTRIFGPV 






29. IDTFGHVDFT 






30. UAHIDHGKSTL 






31, INGFCRJGR 






32. IREGGRTVG 






33. TVGESGSGKS 






34. KFSTYATWWT 






35. KMSKSKGN 






36. KMSKSLGN 






37. KNMITGAAQMDGAILVV 






38. KPNSAJLRK 






39. 1.FGGAGVGKTV 






40. LGPSGCGK 






41. LHAGGKFD 






42. LIDEARTPLilSG 
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In yet another embodiment to the present invention, the number of invariant peptides may 
vary according to the relatedness among the organisms and the number of organisms being 
compared. 

In still another embodiment, the invariant sequences may belong to following proteins as 
available in the database http: //www.ncbi .nim.nih.aryv wherein the said list of proteins 
comprise: 

I DNA DIRECTED RNA POLYMERASE BETA CHAIN 
U EXCINUCLEASE ABC SUBUNIT A 
in EXCINUCLEASE ABC SUBUNIT B 

IV DNA GYRASE SUBUNIT B 

V ATP SYNTHASE BETA CHAIN 

VI S-ADENOSYLMETHIONINE SYNTHETASE 

VII GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE 

VIII ELONGATION FACTOR G (EF-G) 

IX ELONGATION FACTOR TU (EF-TU) 

X 30SRIBOSOMAL PROTEINS 12 

XI SOS RIBOSOMAL PROTEIN LI2 

XII SOS RIBOSOMAL PROTEIN L14 

XIII VALYL tRNA SYNTHETASE (VALRS) 

XIV CELL DIVISON PROTEIN FtSH HOMOLOG 

XV DnaK PROTEIN (HSP70) 

XVI GTP BINDING PROTEIN LepA 

XVII TRANSPORTER 

XVni OLIGOPEPTIDE TRANSPORT ATP BINDING PROTEIN OPPF 
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In still another embodiment to the present invention, the said method of comparing the 
peptide libraries as given in step (iii) of method explained above is carried out by following the 
steps given in figure 1 . 

In yei another embodiment to the present invention, the said method of locating the 
common peptides in the original protein sequences as given in step (iv) method explained above is 
earned out by following ihe steps given in figure 2. 

In another embodiment the method of creating a common peptide of variable length after 
removmg the overlappings as given in step (v) of method explained above is carried out by 
following the steps given in figure 3. 

In another embodiment to the present invention, the microprocessor based system for 
performing the methods of the invention comprises; 

i) means of determining the amino acid sequence window for creation of peptide library and 
subsequent sorting, 

ii) means of comparing the peptide library, 

iii) locating computationally these common peptides in the original proteins and subsequently 
labeling them with their origin and location, 

iv) joining computationally the overlapping common peptides to obtain a long chain of invariant 
peptide sequences. 

In another embodiment of the invention, the computer system for performing the methods 
of the invention comprises, a central processing unit, executing peptide library creating program 
(PEPLIB), peptide library matching program (PEPLIMP), peptide stitching program 
(PEPSTITCH:), peptide extraction program (PEPXTRACT) wherein the said programs are all 
stored in a memory device accessed by the central processing unit connected to a display on which 
the central processing unit displays the screens of the above mentioned programs m response to 
user inputs with a user interface device. 

In yet another embodiment to the present invention, the method for assigning funcuon to a 
protein of unknown function showing no/weak homology to other protein sequences in a publicly 
available database (S WISSPROT) may be carried out by employing the following steps: 

L generating computationally overlapping peptide library from the protein sequences 
of unknown function. 
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II. sorliTig computationally ihe peptides of length *N* (N is the length of the sliding 

window of amino acids) obtained as above, alphabetically, according to single letter 
amino acid code^ 

IIJ. matching computationally the current library with peptide library of all functionally 
known proteins to obtain common peptides, 

IV. locating computationally these common peptides in the original proteins and 
subsequently labeling them with their origin and location, 

V. joining computationally the overlapping common peptides to obtain a long chain of 
invariant peptide sequences, 

VI. assigning fimction to the unknown protein based on the function of the protein with 
which maximum length of peptide sequence identity is found. The more is the 
number of matches with the proteins of similar function the likelihood of functional 
assignment will be higher. 

The particulars of the organisms such as their name, strain, accession number and other 
details are given below. 

Genomes Strain Accession Total Base Date of 

Number Sequences Completion 

Mycobacterium tuberculosis H37Rv AL123456 441 1529 bp Jun 1 1, 1998, 

Cole,ST-, and et.aL A/^ri/re 393 (6685), 537-544 (1998) 

Bacillus subtilis DY AL009126 4214814 bp Nov 20, 1 997 

Kunst^F. and et.aL Nature 390 (6657), 249-256 (1997) 

Mycoplasma genitalium G37 L43967 580074 bp Oct 30, 1995 

Fraser,C.M., and et.al. Science 270 (5235), 397-403 (1995) 

Mycoplasma pneumonia M129 U00089 816394 bp Nov 15, 1996 

Himmelreich,R., and et.al Nucleic Acids Res, 24 (22), 4420-4449 (1996) 

Escherichia coli K-12 U00096 463922 1 bp Oct 13, 1 998. 

Blattner^F.R.,. and et,al Science 277 (5331), 1453-1474 (1997) 
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Helicobacter pylori 26695 AEOOOSli 1667867 bp Aug 6, 1997. 

Tomb,J.-F., and el.al Nature 388 (6642), 539-547 (1997) 



Haemophilus influenzae Rd L42023 1830138 bp Jul 25, 1995. 

Fleischmann,R.D., and el.al Science 269 (5223), 496-512 (1995) 



Genome Proteins Number No. of Proteins in 

of 8-mer which common 
peptides peptides are found 



Bacillus subtilis 


4100 


1174826 


69 


Escherichia coli 


4289 


1302149 


81 


Haemophilus influenzae 


1709 


504044 


56 


Helicobacter pylori 


1566 


474087 


51 


Mycoplasma genitalium 


467 


165523 


30 


Mycoplasma pneumonia 


677 


221216 


43 


Mycobacterium tuberculosis 


3918--- 


1252582 
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Brief description of the accompanying drawings: 

Figure 1 shows a logic circuit of Peptide Library Matching Program. 
Figure 2 shows a Logic circuit of Peptide Extraction Program. 
Figure 3 shows a Logic circuit of Peptide Stitching Program. 

Figure 4 shows crystal structures of three invariant peptides (VRKRPGMYIG, LHAGGKFD and 
SGGLHGVG) fron^ DNA gyrase B protein. 

The invention is explained with the help of the following examples and should not be 
construed to limit the scope of the present invention. 
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Examples 



Ex a m ple 1 

1. The peptide library creation program (PEPLIB) 

The purpose of the program is to create a non-redundant peptide library of user specified window 
length of a given genome by sliding the window by one amino acid residue al a lime. 
The program works as follows: 

The internet downloaded FASTA format files obtained from 
ht.rp: / /www. ncbi .nim.nih. gov were saved by the name <organism_name>.pep are passed as 
input to the PERL program which creates unique peptides of length as specified at the time of 
execution. 

Input / Qutr mt. file format: 

Downloaded Files and their format: 

<organism_name>, pep : file which stores the annotation & the protein sequence 

<organism_name> refers to 

Tb {Mycobacterium tuberculosis) Bs (Bacillus subtil is) Mg {Mycoplasma genitalium) Mp 
{Mycoplasma pneumonia) Ec {Escherichia coli) Hp {Helicobacter pylori) Hi {Haemophilus 
influenzae) 



Format: FASTA 
">gi I "<annotation> 

«the entire protein sequence 

For example, 

>gii280871 l|emb!CAA16238.11 dnaA 

MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRAWLNLVQPLI'iVEGF 
ALLSVPSSFVQNEIERHLRAPITDALSRRLGHQIQLGVRIAPPATDEADDTTVPPSENPATTS 
PDTTTDNDRIDDSAAARGDNOHSWP 
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>gil3261513|emb|CAAl6239.1! dnaN 
MDAATTRVGLTDLTFRLLRESFADAVSWVAKNLPARPAVPVLSGVLLTGSDNGLTISGFD 
YEVSAEAQVGAEIVSPGSVLVSGRLtSDITRALPNKPVDVHVEGNRVALTCGNARFSLPTM 
PVEDYPTLPTLPEETGLLPAE 



The output file; <organism_naiTiex;peptide_length>.txt 
Format: 

<all unique peptides of length specified at the time of execation> 
for example format of Tb8.txt: 
AAAAAAAA 
AAAAAAAG 
AAAAAAAQ 
AAAAAAAS 
AAAAAAAT 



The peptide library matching program (PEPLIMP) 

The purpose of the program is to compare the user defined peptide libraries with each other and 
report the common/ unique peptides. The output files of the program PEPLIB are used as input for 
the PEPLIMP program. As the program is executed the user is prompted to select the libranes thai 
are to be compared. Depending upon the hbraries selected an output file is generated having 
common peptides (Fig 1). Comparison of 8-mer peptide libraries of the above mentioned seven 
organisms resulted into 164 eight-mer peptides. 

Comparison of four pathogenic organisms such as Mycobacterium tuberculosis, 
Helicobacter pylori. Mycoplasma pneumonia and Haemophilus influenzae resulted in 206 invariant 
peptides and comparison of three non-pathogenic organisms such as Bacillus subtilis, Mycoplasma 
genitalium and Escherichia coli resulted in 601 invariant peptides. The comparison tree looks like: 

Tb Ec Bs Hp Hi Mg Mp 



5815 



J 



1767 



36?r 
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Example 3 

The peptide extraction program (PEPXTRACT) 

This program takes the output of PEPLIMP progran^ i.e., all the invariaj^t peptides as input and 
locales these peptides in the protein sequences from the original database and labels them with the 
protein identification number (PID), location and organism name for further analysis. The logic 
circuit of this program is explained in the flow chart shown in figure 2. 

Example 4 

The peptide stitching program (PEPSTITCH) 

This program intelligently removes the overlapping invariant peptides and reports all the 
continuous stretch of invariant peptide present in the protein under cotisideration. This is done by 
first grouping the 'N'-mcr peptides from the same protein of an organism and then keeping track on 
the their location they are merged into a long single peptide, Tlie logic circuit of this program is 
shown in figure 3. 

Example 5 

Prediction of function of hypothetical protein 

An invariant peptide having sequence FSGGQRQR was found to exist in oppF/dppF proteins of 
six organisms out of the seven examined (except for in M tuberculosis). This protein functions as 
an ATP binding protein. Since this invariant peptide has also been found to be located on the 
hypothetical protein encoded by Rvl273c gene in tuberculosis, it is suggested that this protein 
encoded by Rvl273c gene must function as ATP binding protein as it holds the signature of this 
class of protein. 

Example <> 

Prediction of function of hypothetical protein 

Another invariant peptide having sequence GIVGLPNVGKS was found in proteins havmg GTP 
binding function in six bacteria out of the seven examined (except for in M tuberculosis) where as 
the same invariant sequence is present in hypothetical protein encoded by Rvlll2 gene in M. 
tuberculosis. It is strongly suggested that this hypothetical protein may have GTP binding property 
as it holds the signature of this class of protein. 
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Example 7 

Drug target identification based on invariant peptide motifs 

Enzyme DNA gyrase is known to reduce supercoiling of DNA. This protein is absent in human and 
has been considered as a potential drug target. However, the exact sequence to which the drug 
molecules should be targeted is noi yet clear. The peptides such as VRKRPGMYIG, 
l.HAGGKFD, SGGLHGVG, LPGKLADC, VEGDSAGG and QRYKGLGEM thai are 
invariant across many pathogenic and non-pathogenic bacterial DNA gyrase beta subunit, but 
absent in host, arc the structural determinants which could be used as potential drug targets against 
bacterial infections. The crystal structures of three of these peptides are shown in fig 4. 

Example 8 

Assignment of a function to a protein of unknown function 

With the help of this method one can assign function to a protein of unknown function 
showing no/weak homology to other protein sequences in a publicly available database 
(SWISSPROT) by employing the following steps: 

L generating computationally overlapping peptide library from the protein sequences 
of unknown function, 

ri. sorting computationally the peptides of length (N is the length of the sliding 
window of amino acids) obtained as above, alphabetically, according to single letter 
amino acid code, 

III. matching computationally the current library v/ith peptide library of all functionally 
known proteins to obtain common peptides, 

IV. locating computationally these common peptides in the original proteins and 
subsequently labeling them with their origin and location, 

V. joining computationally the overlapping common peptides to obtain a long chain of 
invariant peptide sequences, 

VI. assigning function to the unknown protein based on the function of the protein with 
which maximum length of peptide sequence identity is found. The more is the 
number of matches with the proteins of similar function the likelihood of functional 
assignment will be higher. 
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Advantages: 

1. Main advantage of ihe present invention is to provide a new method of genome-wise 
comparison of large number (thousands) of proteins of one organism with proteins of other 
organisms simultaneously to arrive at invariant peptide sequence motif signatures. 

2. It provides a rapid method of identification of invariant peptide motifs. 

3. It provides a simple and highly accurate method of determining invariant peptide motifs as it 
does not involve any complex mathematical calculations, 

4. It provides a basis for a screening assay for broad-spectrum antibacterial compounds. 
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Claims: 

We claim, 

1. A computer-based method for identifying invariant peptide motifs useful as drug targets 
wherein the said method comprises the steps of: 

i) generating computationally overlapping peptide libraries from all the protein sequences of the 
selected organisms available al hllp://www.ncbi.nlm.nih.gov, 

ii) sorting computationally the peptides of length *N* obtained as above, alphabetically, according 
to single letter amino acid code, 

iii) matching computationally common peptide sequences of the selected bacteria, 

iv) locating computationally these common peptides in the original proteins and subsequently 
labeling them with their origin and location, 

v) joining computationally the overlapping common pqstides to obtain a long chain of invariant 
peptide sequences, 

vi) annotating secondary structure of these conserved peptides from the crystal structure database, 

vii) comparing pathogenic strain genomes against genomes of non-pathogenic strains and selecting 
the sequences not commonly conserved in these two groups, 

viii) validating computationally the invariant sequence motifs as potential drug target sequence by 
searchmg for the given conserved sequences in the host genome and rejecting the ones present in 
the host genome. 

2. The method of claim 1 wherein the length of the sliding window of length 'N' ranges from 4 to 
any length of amino acid residues. 

3. The method of claim 1 wherein the protein sequence data is taken from any organism but not 
specifically limited to microbes such as Mycoplasma pneumoniae. Helicobacter pylori. 
I [emophtllus influenzae, Mycobacterium tuberculosis. Mycoplasma genitalium. Bacillus subfillis. 
Escherichia coli. 
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4. A method as claimed in claim I where conserved peptide motifs as identified comprising: 



L 


AAQSIGEPGTQLT 


35. 


KMSKSKGN 


1. 


AGDGTTTAT 


36. 


KMSKSLGN 


3. 


AGRHGNKG 


37. 


KNMITG AAQMDG AIL W 


4, 


AHIDAGKTTT 


38* 


kpnsalrk 


5. 


CPIETPEG 


39. 


LFGGAGVGKTV 


6- 


DEPSIGLH 


40. 


LGPSGCGK 


7. 


DEPTSALD 


41. 


LHAGGKFD 


8. 


DEPTTALDVT 


42. 


lideartplitsg 


9. 


DHAGIATQ 


43. 


LLNRAPTLH 


10. 


DHPHGGGEG 


44. 


LPDKAroUDE 


11. 


DLGGGTFD 


45. 


LPGKLADC 


12. 


DVLDTWFSS 


46. 


LSGGQQQR 


13. 


ERERGITI 


47. 


MGHVDHGKT 


14, 


ERGITITSAAT 


48. 


NADFDGDQMAVH 


15. 


ESRRIDNQLRGR 


49. 


NGAGKSTL 


16. 


FSGGQRQR 


SO. 


NLLGKRVD 


17. 


GEPGVGKTA 


51, 


NTDAEGRL 


18. 


GFDYLRDN 


52. 


PSAVGYQPTLA 


19. 


GHNLQEHS 


53. 


QRVAIARA 


20. 


GIDLGTTNS 


54. 


QRYKGLGEM 


21. 


GINLLREGLD 


SS.. 


RDGLKPYHRR 


22, 


GIVGLPNVGKS 


56. 


SALDVSIQA 


23. 


GKSSLLNA 


57. 


SGGLHGVG 


24. 


GLTGRKIIVDTYG 


58. 


SGSGKSSL 


25. 


GPPGTGKTLLA 


59. 


SGSGKSTL 


26. 


GPPGVGKT 


60. 


SVFAGVGERTREGND 


27. 


GSGKTTLL 


61. 


TGRTHQIRVH 


28. 


GTRIFGPV 


62. 


TGVSGSGKS 


29. 


IDTPGHVBFT 


63. 


TLSGGEAQRl 


30. 


UAHIBHGKSTL 


64. 


TNKYAEGYP 


31. 


INGFGRIGR 


65. 


TPRSNPATY 


32. 


IREGGRTVG 


66. 


VEGDSAGG 


33. 


IVGESGSGKS 


67. 


VRKRPGMYTG 


34. 


KFSTYATSVWI 







5. A method as claimed in claim 1 wherein the number of invariant peptides varies according to 
the rclatedness among the organisms and the number of organisrns being compared. 

6, A method as claimed in claim 1-4 wherein the invariant sequences belong to following protems 
as available in the database ^ith pt //www.ncbi .nim.n ih.aQv wherein the said list of proteins 
comprise; 

I DNA DIRECTED RNA POLYMERASE BETA CHAIN 

II EXCINUCLEASE ABC SUBUNIT A 

III EXCINUCLEASE ABC SUBUNIT B 

IV DNA GYRASE SUBUNIT B 
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V 


ATP SYNTHASE BETA CHAIN 


VI 


S-ADENOSYLMETHIONINE SYNTHETASE 


VII 


GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE 


VJIT 


ELONGATION FACTOR G (EF-G) 


IX 


ELONGATION FACTOR TU (EF-TU) 


X 


30$ RIBOSOMAL PROTEIN S12 


XI 


503 RIBOSOMAL PROTEIN LI 2 


XII 


SOS RIBOSOMAL PROTEIN LI 4 


XIII 


VALYL tRNA SYNTHETASE (VALRS) 


XIV 


CELL DIVISON PROTEIN FtSH HOMOLOG 


XV 


DnaK PROTEIN (HSP70) 


XVI 


GTP BINDING PROTEIN LepA 


xvu 


TRANSPORTER 



XVni OLIGOPEPTIDE TRANSPORT ATP BINDING PROTEIN OPPF 

7. A method as claimed in claim I wherein the said method of comparing the peptide libraries as 
given in step (iii) of claim 1 is carried out by following the steps given in figure 1 , 

8. A method as claimed in claim 1 wherein the said method of locating the common peptides in 
the original protein sequences as given m step (iv) of claim 1 is carried out by following the 
steps given in figure 2. 

9. A method as claimed in claim 1 wherein the said method of creating a common peptide of 
variable length after removing the overlappings as given in step (v) of claim 1 is carried out by 
following the steps given in figure 3. 

1 0. A microprocessor based system for performing the methods of the invention which comprises; 

i) means of determining the amino acid sequence window for creation of peptide library and 
subsequent origin tagging, 

ii) means of comparing the peptide library. 
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iii) locating computational ly these common peptides in the original proteins and subsequently 
liibeling them with their origin and location, 

iv) joining computationally the overlapping common peptides to obtain a long chain of invariant 
peptide sequences, 

n . A computer based system for performing the methods of the invention further comprising a 
central processing unit, executing peptide library creating program (PEPLIB), peptide library 
matching program (PEPLIMP), peptide stitching program (PEPSTITCH), peptide extraction 
program (PEPXTRACT) wherein the said programs are all stored in a memory device accessed 
by the central processing unit connected to a display on which the central processing unit 
displays the screens of the above mentioned programs in response to user inputs with a user 
interface device. 

12. A method for assigning function to a protein of unknown fimction showing no/weak homology 
to other protein sequences in a publicly available database (SWISSPROT) by employing the 
following steps: _ 

L generating computationally overlapping peptide library from the protein sequences 
of unknown function, 

IL sorting computationally the peptides of length 'N' (N Ls the length of the sliding 
window of amino acids) obtained as above, alphabetically, according to smgle letter 
amino acid code, 

III. matching computationally the current library with peptide library of all ftinctionally 

known proteins to obtain common peptides, 
rV. locating computationally these common peptides in the original proteins and 

subsequently labeling them with their origin and location, 

V. joining computationally the overlapping common peptides to obtain a long chain of 
invariant peptide sequences, 

VI. assigning function to the unknown protein based on the function of the protein with 
which maximum length of peptide sequence identity is found. The more is the 

r - 

number of matches with the proteins of similar function the likelihood of functional 
assignment will be higher. 
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A computer based method for identifying peptides 
useful as drug targets 



Abstract 

The present invention relates to a novel computer based method for performing genome- 
wise comparison of several organisms, the said computational method involves creation of peptide 
libraries from protein sequences of several organisms and subsequent comparison leading to 
identification of conserved invariant peptide motifs, and to this end several invariant peptide motifs 
have been identified by direct sequence comparison between various bacterial organisms and host 
genomes without any a priori assumptions, and the present method is useful for identification of 
potential drug targets and can serve as drug screen for broad-spectrum antibacterials as well as for 
Specific diagnosis of infections, and in addition, for assignment of function to proteins of yet 
unknown functions with the help of such invariant peptide motif signatures. 
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PEPSTITCH 
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